$$ Project \ 4: \ Feature \ Engineering, \ Model \ Selection, \ Model \ Tuning \ $$

$$ Concrete \ Compressive \ Strength \ Prediction $$

$$ (A \ Regression \ Problem) $$

$Objective:$ To predict the Concrete Compressive Strength using the data available in file concrete_data.xls. Apply Feature Engineering and Model Tuning to obtain $80\% \ to \ 95\% \ of \ R^2 \ Score$.

$Derived \ Project \ Conclusion$: Test Data SCORE: 0.9161131402242251 ***(91.61%)***: On the UNSEEN / Reserved Test Data : BEST Final SCORE on high end of Goal Range: 80% to 95% : Objective Achieved !!!

$Resources \ Available:$ The data for this project is available in file https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/. The same has been shared along with the course content.


Given Information AS Per Project Doc & Data Resource URL Above:


  • ### $Data \ Characteristics$: Data is in ***RAW Form (NOT Scaled).*** The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory.

  • ### $Attribute \ Information$: The order of this listing corresponds to the order of numerals along the rows of the database:
No. Name Meaning Data Name Short Data Name Analytic Type Data Type Unit Of Measure Variable Type Description / Comments
1 Cement An Ingredient cement cement Quantitative Numeric Kg/m3 mix Input Predictor Variable
2 Blast Furnace Slag An Ingredient slag slag Quantitative Numeric Kg/m3 mix Input Predictor Variable
3 Fly Ash An Ingredient ash ash Quantitative Numeric Kg/m3 mix Input Predictor Variable
4 Water An Ingredient water water Quantitative Numeric Kg/m3 mix Input Predictor Variable
5 Superplasticizer An Ingredient superplastic splast Quantitative Numeric Kg/m3 mix Input Predictor Variable
6 Coarse Aggregate An Ingredient coarseagg corse Quantitative Numeric Kg/m3 mix Input Predictor Variable
7 Fine Aggregate An Ingredient fineagg fine Quantitative Numeric Kg/m3 mix Input Predictor Variable
8 Age Days from Casting age age Quantitative Numeric Day (1...365) Input Predictor Variable
9 Concrete Compressive Strength Measure of Strength strength mpa Quantitative Numeric MPa (Mega Pascal) OUTPUT PREDICTED Variable
  • Added By MSB (Project Coder/Developer) Above^: New Info added to the table above: Short Data Name , Data Type , Comments


  • ### $The \ Concrete \ Compressive \ Strength \ Prediction$: It is a Multi Variate Regression Problem having:

    • Eight (8) Input (Predictors) variables (Data Columns)
    • One (1) OUTPUT (PREDICTED) variable (Data Column)
    • Nine (9) Total Attributes (Data Columns)
    • 1,030 Total Observarions (Data Rows)
    • No (Zero) Missing Attributes Values (Data Columns Values)

Project Rubric / Scoring Guide: Concrete Compression Strength Prediction

Points / Marks By Criteria:
  • 10 = Univariate Analysis
  • 10 = Bivariate Analysis
  • 10 = Feature Engineering
  • 15 = Modelling
  • 15 = Hyper Parameter Tuning
  • 60 = Total Points

Project Steps & Tasks: Deliverables: Concrete Compression Strength Prediction

*D1: Exploratory Data Quality Report reflecting the following:*

  • ###### D1.1: Univariate Analysis: 10 Marks:
    Data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers.

  • ###### D1.2: Bi-variate Analysis: 10 Marks:
    Analyze Among Predictor Variables and Between Predictor & Target Columns. Comment on your findings in terms of their Relationship and Degree of Relation if any. Visualize the analysis using Boxplots and Pair Plots with Histograms or Density Curves.

  • ###### D1.3: Feature Engineering Techniques: 10 Marks:

    • $D1.3.a:$ Identify opportunities (if any) to Extract a New Feature from existing features and/or Drop a Feature (if required).
    • $D1.3.b:$ Get data model ready and do a train test split.
    • $D1.3.c:$ Decide on complexity of the model, should it be Simple Linear Model in terms of Parameters or would it be a Quadratic or Higher degree.

*D2: Creating the Model and Tuning it:*

  • ###### D2.1: Algorithms: 15 Marks:
    Use the Algorithms that you think will be suitable for this project (at least 3 algorithms). Use ***Kfold Cross Validation*** to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics.
  • ###### D2.2: Techniques: 15 Marks:
    Employ Techniques to squeeze that extra performance out of the model without making it over fit. Use ***Grid Search or Random Search*** on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above.
In [1422]:
# v====== Standard Libraries Begin ======v #

import warnings
warnings.filterwarnings('ignore')

import numpy as np  # Numerical Python libraries
# random_state = np.random.RandomState(0)  # From Mukesh Rao. MSB: Do we need to do this? Working ok without it.

import pandas as pd  # to handle data in form of rows and columns
import pandas_profiling 

import pylab as pl  # Mukesh Rao  

import seaborn as sns  # Data visualization for statistical graphics  
import matplotlib.pyplot as plt  # Data visualization for Ploting  

from sklearn.svm import SVC  # M.Rao; SVC = Support Vector Classification
from sklearn import metrics, svm  # "svm" = Support Vector Machine > MRao; For Lin/Log Regr, DTree  
from sklearn.impute import SimpleImputer  
from sklearn.utils import resample, shuffle  # "shuffle"=> Mukesh Rao; Bagging Sample data set creation  
from sklearn.model_selection import train_test_split, KFold, cross_val_score, LeaveOneOut, GridSearchCV, RandomizedSearchCV  # Lin/LogR, DTree  
from sklearn.pipeline import Pipeline, make_pipeline  # M.Rao
from sklearn.neighbors import KNeighborsClassifier  # MRao

# For Linear Dimensionality (Cols/Attributes) Reduction to a Lower dimensional space (eg: reduce 15 cols to 2 cols):  
from sklearn.decomposition import PCA  # "Principal Component Analysis" for "Singular Value Decomposition" (SVD)

# ClusterCentroids=Cluster based UNDERsampling, TomekLinks=Under sampling by Deleting nearest majority neighbor/similar rows
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids, TomekLinks 
from imblearn.over_sampling import SMOTE  # Over sampler  
from imblearn.combine import SMOTETomek  # OVER / UP Sampling followed by UNDER / DOWN Sampling

from mlxtend.feature_selection import SequentialFeatureSelector as sfs  # For Features selection  
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs  # For Plotting  

# ====== For Linear Regression ======

from scipy.stats import zscore, pearsonr, randint as sp_randint

from category_encoders import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures, binarize, LabelEncoder, OneHotEncoder # M.Rao

from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge 

# Import Linear Regression machine learning library: 
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score  # LinRegr

import statsmodels.api as sm  # For OLS Summary in Linear Regression  
import statsmodels.formula.api as smf  # For OLS Summary in Linear Regression  

from yellowbrick.regressor import ResidualsPlot
from yellowbrick.classifier import ClassificationReport, ROCAUC

# ====== For Logistic Regression ======
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score
from sklearn.metrics import f1_score, roc_curve, roc_auc_score, classification_report, auc # Mukesh Rao  

# ====== For Decision Tree ======
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
# from sklearn.externals.six import StringIO  # Discontinued in Scikit Version 0.23 (available only upto Ver 0.22)
import pydotplus as pdot  # to display decision tree inline within the notebook
import graphviz as gviz

# DTree does not take strings as… # … input for the model fit step....
from sklearn.feature_extraction.text import CountVectorizer 

# ======= For Ensemble Techniques =======
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingRegressor,  RandomForestRegressor,  AdaBoostRegressor,  GradientBoostingRegressor

# ======= Set default style ========

# Multiple output displays per cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import Image, Markdown
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>")) # Increase cell width

# ===== Options =====

pd.options.display.float_format = '{:,.2f}'.format  # Remove scientific notations to display numbers with 2 decimals
pd.set_option('display.max_columns', 100)  # Max df cols to display set to 100.
pd.set_option('display.max_rows', 50)  # Max df rows to display set to 50.
# pd.set_option('display.max_rows', tdf.shape[0]+1)  # just one row more than the total rows in df

# Update default style and size of charts
plt.figure(figsize=(12,8))
plt.style.use('ggplot')  # plt.style.use('classic') ?? 
plt.rcParams['figure.figsize'] = [10, 8]

sns.set_style(style='darkgrid')
%matplotlib inline

import pickle  # For model export
from os import system  # For system (eg MacOS, etc) commands from within python

# ====== Standard Libraries End ======^ #
Out[1422]:
<Figure size 864x576 with 0 Axes>
In [58]:
# Read & Load the input Datafile into dataset frame: Concrete DataFrame: 
cdf = pd.read_csv('concrete.csv')
cdf
Out[58]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.30 212.00 0.00 203.50 0.00 971.80 748.50 28 29.89
1 168.90 42.20 124.30 158.30 10.80 1,080.80 796.20 14 23.51
2 250.00 0.00 95.70 187.40 5.50 956.90 861.20 28 29.22
3 266.00 114.00 0.00 228.00 0.00 932.00 670.00 28 45.85
4 154.80 183.40 0.00 193.30 9.10 1,047.40 696.70 28 18.29
... ... ... ... ... ... ... ... ... ...
1025 135.00 0.00 166.00 180.00 10.00 961.00 805.00 28 13.29
1026 531.30 0.00 0.00 141.80 28.20 852.10 893.70 3 41.30
1027 276.40 116.00 90.30 179.60 8.90 870.10 768.30 28 44.28
1028 342.00 38.00 0.00 228.00 0.00 932.00 670.00 270 55.06
1029 540.00 0.00 0.00 173.00 0.00 1,125.00 613.00 7 52.61

1030 rows × 9 columns

In [65]:
# My Housekeeping: Incremental DF Data Backup 0 as of now: 
cdf0 = cdf.copy()  # Original Df

cdf.to_csv('cdf0.csv')  # Also export as .csv file to disk
! ls -l cdf*

# Verify backup copy
cdf0.shape, type(cdf0)
cdf0.sample(7)
-rw-r--r--  1 RiddhiSiddhi  staff  52485 Jun 30 01:52 cdf0.csv
Out[65]:
((1030, 9), pandas.core.frame.DataFrame)
Out[65]:
cement slag ash water superplastic coarseagg fineagg age strength
581 525.00 0.00 0.00 189.00 0.00 1,125.00 613.00 180 61.92
316 218.90 0.00 124.10 158.50 11.30 1,078.70 794.90 28 30.22
638 186.20 124.10 0.00 185.70 0.00 1,083.40 764.30 7 8.00
233 310.00 0.00 0.00 192.00 0.00 1,012.00 830.00 28 27.83
7 251.40 0.00 118.30 188.50 6.40 1,028.40 757.70 56 36.64
1008 213.50 0.00 174.20 154.60 11.70 1,052.30 775.50 100 59.30
687 168.90 42.20 124.30 158.30 10.80 1,080.80 796.20 100 48.15
In [66]:
# My Housekeeping: Rename column names for convenience, meaningfulness and intuitiveness:  

cdf.rename(columns={'superplastic': 'splast', 'coarseagg': 'corse', 'fineagg': 'fine', 'strength': 'mpa'}, 
           inplace=True, errors='raise')
cdf.sample(6)
Out[66]:
cement slag ash water splast corse fine age mpa
1027 276.40 116.00 90.30 179.60 8.90 870.10 768.30 28 44.28
179 296.00 0.00 0.00 192.00 0.00 1,085.00 765.00 7 14.20
725 500.00 0.00 0.00 200.00 0.00 1,125.00 613.00 28 44.09
464 255.50 170.30 0.00 185.70 0.00 1,026.60 724.30 28 32.05
34 331.00 0.00 0.00 192.00 0.00 1,025.00 821.00 90 37.91
699 475.00 118.80 0.00 181.10 8.90 852.10 781.50 28 68.30
In [72]:
# My Housekeeping: Incremental DF Data Backup 1 as of now: 
cdf1 = cdf.copy()  # Modified Df: Changed Columns Names to shorten: 'splast', 'corse', 'fine', 'mpa'

cdf1.to_csv('cdf1.csv')  # Also export as .csv file to disk
! ls -l cdf*

# Verify backup copy
cdf1.shape, type(cdf1)
cdf1.sample(6)
-rw-r--r--  1 RiddhiSiddhi  staff  52485 Jun 30 01:52 cdf0.csv
-rw-r--r--  1 RiddhiSiddhi  staff  52467 Jun 30 02:27 cdf1.csv
Out[72]:
((1030, 9), pandas.core.frame.DataFrame)
Out[72]:
cement slag ash water splast corse fine age mpa
17 336.00 0.00 0.00 182.00 3.00 986.00 817.00 28 44.86
460 349.00 0.00 0.00 192.00 0.00 1,047.00 806.00 28 32.72
940 153.60 144.20 112.30 220.10 10.10 923.20 657.90 28 16.50
234 522.00 0.00 0.00 146.00 0.00 896.00 896.00 7 50.51
173 276.00 116.00 90.00 180.00 9.00 870.00 768.00 28 44.28
56 182.00 45.20 122.00 170.20 8.20 1,059.40 780.70 100 48.67

D1.1: Univariate Analysis: Begins below: vvv

Data types & description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers.

Note: Data Name, Type, Meaning, Description has been tabulated above in Header Setion at the Top^^^
In [68]:
profile = pandas_profiling.ProfileReport(cdf)
profile








Out[68]:

In [153]:
### My Housekeeping: Jupyter Code File Incremental Backup 1 Tue.Jun.30 2:47am ^^^
Markdown("### Incremental Jupyter Notebook Code Backup 1")
# ! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 1.ipynb" 
! ls -l Project*.ipynb
Out[153]:

Incremental Jupyter Notebook Code Backup 1

-rw-r--r--  1 RiddhiSiddhi  staff  6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  6745802 Jul  5 21:45 Project 4 FEMST Concrete Strength Predict.ipynb
In [569]:
Markdown('### * DF Head & Tail rows:')
cdf

Markdown('### * DF Random Sample rows:')
cdf.sample(7)

Markdown('### * DF Shape: Number of (Rows, Columns), DF Type:')
cdf.shape, type(cdf)  

Markdown('### * DF Info with: Column Names & Data Types')
cdf.info()  

Markdown('### * DF Stats for Numerical value Cols: Range (Min & Max), Central values (Mean), Std.D, Quartiles')
cdf.describe()  

Markdown('### * DF Stats for Numerical value Cols: Central values: MEDIAN which are not in Standard "Describe" above^:')
cdf.median()  

Markdown('### * DF Stats for All Cols: Central values: MODE which are not in Standard "Describe" above^:')
cdf.mode()  

Markdown(""" ### * $ \ {dup}$ = Duplicate <u>Columns</u> Based on ALL Rows """.format(dup=cdf.T.duplicated().sum()))  

Markdown(""" ### * ${dup}$ = Duplicate <u>Rows</u> Based on ALL 9 Columns """.format(dup=cdf.duplicated().sum()))  

Markdown('### * DF Duplicated <u>Rows</u> Based on CERTAIN "Mixture Ingredients" <u>Columns</u>:')  

print(' ', cdf.iloc[:,:8].duplicated().sum(), 
      '= Dup. Rows for first 8 Cols: cement, slag, ash, water, splast, corse, fine, age. (Except: "mpa")') 
      
print('', cdf.iloc[:,:7].duplicated().sum(), 
      '= Dup. Rows for first 7 Cols: cement, slag, ash, water, splast, corse, fine. (Except: "age" & "mpa")') 

print(' ', cdf.duplicated(['cement', 'slag', 'ash', 'water', 'splast', 'corse', 'fine', 'mpa']).sum(), 
      '= Dup. Rows for first 7 & "mpa" Cols: cement, slag, ash, water, splast, corse, fine, mpa. (Except: "age")') 

Markdown('### * DF Unique Values for All Columns:')
cdf.nunique()

Markdown('### * DF Null values for All Columns:')
cdf.isna().sum()

Markdown('### * DF NON Numeric Values in Numerical Columns:')
cdf[~cdf.select_dtypes(include='number').applymap(np.isreal).all(1)].count()  # With "~" for NOT Real Numbers (Non Numeric)

Markdown('### * DF Zero Values in Numerical Columns:')
(cdf.select_dtypes(include='number') == 0).sum()

Markdown('### * DF Negative (-ve) Values in Numerical Columns:')
(cdf.select_dtypes(include='number') < 0).sum()

Markdown('### * DF Skewness for All Numerical Columns:')
cdf.skew()

Markdown('###### * Note: Skew Categories (Arbitrary): ')  
print('''If distribution Skewness is between following ranges then Skewness for Column is...:  
* High     : < −1  OR > +1               Asymmetric  Col : age  
* Moderate : −1    &  −0.5 OR +0.5 & +1  Asymmetric  Cols: cement, slag, ash, splast  
* Low      : −0.5  &  +0.5               Asymmetric  Col : mpa  
* V.Low    : −0.25 &  +0.25              Symmetric   Col : fine  
* No Skew  :  0.0  OR near -0.0+         Symmetric   Cols: water, corse  
''')  
Out[569]:

* DF Head & Tail rows:

Out[569]:
cement slag ash water splast corse fine age mpa
0 141.30 212.00 0.00 203.50 0.00 971.80 748.50 28 29.89
1 168.90 42.20 124.30 158.30 10.80 1,080.80 796.20 14 23.51
2 250.00 0.00 95.70 187.40 5.50 956.90 861.20 28 29.22
3 266.00 114.00 0.00 228.00 0.00 932.00 670.00 28 45.85
4 154.80 183.40 0.00 193.30 9.10 1,047.40 696.70 28 18.29
... ... ... ... ... ... ... ... ... ...
1025 135.00 0.00 166.00 180.00 10.00 961.00 805.00 28 13.29
1026 531.30 0.00 0.00 141.80 28.20 852.10 893.70 3 41.30
1027 276.40 116.00 90.30 179.60 8.90 870.10 768.30 28 44.28
1028 342.00 38.00 0.00 228.00 0.00 932.00 670.00 270 55.06
1029 540.00 0.00 0.00 173.00 0.00 1,125.00 613.00 7 52.61

1030 rows × 9 columns

Out[569]:

* DF Random Sample rows:

Out[569]:
cement slag ash water splast corse fine age mpa
627 387.00 20.00 94.00 157.00 13.90 938.00 845.00 3 25.51
556 500.00 0.00 0.00 200.00 0.00 1,125.00 613.00 90 47.22
571 157.00 236.00 0.00 192.00 0.00 935.40 781.20 28 33.66
862 310.00 0.00 0.00 192.00 0.00 1,012.00 830.00 120 38.70
134 339.00 0.00 0.00 197.00 0.00 968.00 781.00 3 13.22
439 255.00 0.00 0.00 192.00 0.00 889.80 945.00 3 8.20
715 303.60 139.90 0.00 213.50 6.20 895.50 722.50 28 33.42
Out[569]:

* DF Shape: Number of (Rows, Columns), DF Type:

Out[569]:
((1030, 9), pandas.core.frame.DataFrame)
Out[569]:

* DF Info with: Column Names & Data Types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   cement  1030 non-null   float64
 1   slag    1030 non-null   float64
 2   ash     1030 non-null   float64
 3   water   1030 non-null   float64
 4   splast  1030 non-null   float64
 5   corse   1030 non-null   float64
 6   fine    1030 non-null   float64
 7   age     1030 non-null   int64  
 8   mpa     1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
Out[569]:

* DF Stats for Numerical value Cols: Range (Min & Max), Central values (Mean), Std.D, Quartiles

Out[569]:
cement slag ash water splast corse fine age mpa
count 1,030.00 1,030.00 1,030.00 1,030.00 1,030.00 1,030.00 1,030.00 1,030.00 1,030.00
mean 281.17 73.90 54.19 181.57 6.20 972.92 773.58 45.66 35.82
std 104.51 86.28 64.00 21.35 5.97 77.75 80.18 63.17 16.71
min 102.00 0.00 0.00 121.80 0.00 801.00 594.00 1.00 2.33
25% 192.38 0.00 0.00 164.90 0.00 932.00 730.95 7.00 23.71
50% 272.90 22.00 0.00 185.00 6.40 968.00 779.50 28.00 34.45
75% 350.00 142.95 118.30 192.00 10.20 1,029.40 824.00 56.00 46.14
max 540.00 359.40 200.10 247.00 32.20 1,145.00 992.60 365.00 82.60
Out[569]:

* DF Stats for Numerical value Cols: Central values: MEDIAN which are not in Standard "Describe" above^:

Out[569]:
cement   272.90
slag      22.00
ash        0.00
water    185.00
splast     6.40
corse    968.00
fine     779.50
age       28.00
mpa       34.45
dtype: float64
Out[569]:

* DF Stats for All Cols: Central values: MODE which are not in Standard "Describe" above^:

Out[569]:
cement slag ash water splast corse fine age mpa
0 362.60 0.00 0.00 192.00 0.00 932.00 594.00 28.00 33.40
1 425.00 nan nan nan nan nan 755.80 nan nan
Out[569]:

* $ \ 0$ = Duplicate Columns Based on ALL Rows

Out[569]:

* $25$ = Duplicate Rows Based on ALL 9 Columns

Out[569]:

* DF Duplicated Rows Based on CERTAIN "Mixture Ingredients" Columns:

  38 = Dup. Rows for first 8 Cols: cement, slag, ash, water, splast, corse, fine, age. (Except: "mpa")
 603 = Dup. Rows for first 7 Cols: cement, slag, ash, water, splast, corse, fine. (Except: "age" & "mpa")
  25 = Dup. Rows for first 7 & "mpa" Cols: cement, slag, ash, water, splast, corse, fine, mpa. (Except: "age")
Out[569]:

* DF Unique Values for All Columns:

Out[569]:
cement    278
slag      185
ash       156
water     195
splast    111
corse     284
fine      302
age        14
mpa       845
dtype: int64
Out[569]:

* DF Null values for All Columns:

Out[569]:
cement    0
slag      0
ash       0
water     0
splast    0
corse     0
fine      0
age       0
mpa       0
dtype: int64
Out[569]:

* DF NON Numeric Values in Numerical Columns:

Out[569]:
cement    0
slag      0
ash       0
water     0
splast    0
corse     0
fine      0
age       0
mpa       0
dtype: int64
Out[569]:

* DF Zero Values in Numerical Columns:

Out[569]:
cement      0
slag      471
ash       566
water       0
splast    379
corse       0
fine        0
age         0
mpa         0
dtype: int64
Out[569]:

* DF Negative (-ve) Values in Numerical Columns:

Out[569]:
cement    0
slag      0
ash       0
water     0
splast    0
corse     0
fine      0
age       0
mpa       0
dtype: int64
Out[569]:

* DF Skewness for All Numerical Columns:

Out[569]:
cement    0.51
slag      0.80
ash       0.54
water     0.07
splast    0.91
corse    -0.04
fine     -0.25
age       3.27
mpa       0.42
dtype: float64
Out[569]:
* Note: Skew Categories (Arbitrary):
If distribution Skewness is between following ranges then Skewness for Column is...:  
* High     : < −1  OR > +1               Asymmetric  Col : age  
* Moderate : −1    &  −0.5 OR +0.5 & +1  Asymmetric  Cols: cement, slag, ash, splast  
* Low      : −0.5  &  +0.5               Asymmetric  Col : mpa  
* V.Low    : −0.25 &  +0.25              Symmetric   Col : fine  
* No Skew  :  0.0  OR near -0.0+         Symmetric   Cols: water, corse  

In [154]:
### My Housekeeping: Jupyter Code File Incremental Backup 2 Sun.Jul.05 9:49pm ^^^
Markdown("### Incremental Jupyter Notebook Code Backup 2")
! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 2.ipynb" 
! ls -l Project*.ipynb
Out[154]:

Incremental Jupyter Notebook Code Backup 2

-rw-r--r--  1 RiddhiSiddhi  staff  6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict.ipynb
In [571]:
# Indetify Outlier Values in All Numerical Columns: 

Markdown('### * DF Number of Outliers for Numerical Columns: $ \ \ Low = Q1 - (IQR * 1.5) $; $ \ \ High = Q3 + (IQR *1.5 $)')
for col in cdf.select_dtypes(include='number'):
    q1   = cdf[col].quantile(.25)
    q3   = cdf[col].quantile(.75)
    otr  = (q3 - q1) * 1.5
    otl = int(q1 - otr)
    oth = int( q3 + otr)
    otls = (cdf[col] < otl).sum()
    oths = (cdf[col] > oth).sum()
    
    print('*', col)
    print(' ', str(otls).rjust(2, ' '), 'outliers Under Low  End', str(otl).rjust(5, ' '))
    print(' ', str(oths).rjust(2, ' '), 'outliers Over  High End', str(oth).rjust(5, ' '))
    print()
Out[571]:

DF Number of Outliers for Numerical Columns: $ \ \ Low = Q1 - (IQR 1.5) $; $ \ \ High = Q3 + (IQR *1.5 $)

* cement
   0 outliers Under Low  End   -44
   0 outliers Over  High End   586

* slag
   0 outliers Under Low  End  -214
   2 outliers Over  High End   357

* ash
   0 outliers Under Low  End  -177
   0 outliers Over  High End   295

* water
   5 outliers Under Low  End   124
   4 outliers Over  High End   232

* splast
   0 outliers Under Low  End   -15
  10 outliers Over  High End    25

* corse
   0 outliers Under Low  End   785
   0 outliers Over  High End  1175

* fine
   0 outliers Under Low  End   591
   5 outliers Over  High End   963

* age
   0 outliers Under Low  End   -66
  59 outliers Over  High End   129

* mpa
   0 outliers Under Low  End    -9
   9 outliers Over  High End    79

In [410]:
# Histogram & Density of Entire Dataset: UnScaled & Scaled (zscore): Visual Distribution of DF values: 

p = plt.figure(figsize=(20,7))

p = plt.subplot(1, 2, 1) 
g = sns.distplot(cdf)

p = plt.subplot(1, 2, 2) 
g = sns.distplot(cdf.corr()) 
In [346]:
# Individual Histogram of ALL 9 Numerical Columns: Visual Distribution of column values: 
# catch plt sns outputs in some variables ("p" & "g") to supress informational output 
p = plt.figure(figsize=(20, 20)) 

pos = 1 
for col in cdf.columns: 
    p = plt.subplot(3, 3, pos) 
    g = sns.distplot(cdf[col]) 
    pos += 1 
D1.1 Comments / Conclusions / Recommendations:
  • Very many Zero values in 3 columns: slag = 471, (46%), ash = 566, (55%), splast = 379, (37%) = Impute / Scale / Drop "ash"? = NO. This are VALID ZERO valuea. Leave them as it is = Per Domain/Academic Support: These can be zeros and are VALID
  • Not many Duplicate Rows: Only 25 (2.43%) = Delete/Drop.
  • Skewness: Scale

    • High : age
    • Moderate : cement, slag, ash, splast
    • Low : mpa
    • V.Low : fine
    • No Skew : water, corse
  • Very few outliers: Impute / Scale.

    • cement, ash, corse: No (zero) Outliers
    • slag (rare): 2 outliers (Over High Limit)
    • water (medium): 9 outliers ( 5 Under Low; 4 Over High Limits)
    • splast (medium): 10 outliers (Over High Limit)
    • fine (rare) : 5 outliers (Over High Limit)
    • age (high) : 59 outliers (Over High Limit)
    • mpa (medium) : 9 outliers (Over High Limit)
D1.1 ^^^Completed & Delivered As Per Above Exhibits: Univariate Analysis Ends here: ^^^
  • Displayed: Data Types, Names, Description, Meaning, Data Profile, Data Example Rows (Head, Tail, Random Sample)
  • Generated: Date Values Range (Min/Max, Value Counts), Central Values (Mean, Median, Mode), Standard Deviation, Quartiles
  • Presented: Visual / Graphical Data Distribution and Tails for Numerical Values
  • Identified: Zero & Negative Values, Missing (Null & Unknown/Unspecified) Values, Outliers
  • Conclusion: Comments & Recommendations provided above^
    ##### Note: Data Name, Type, Meaning, Description has been tabulated in the Header Section at the Top^^^
In [379]:
### My Housekeeping: Jupyter Code File Incremental Backup 3 Mon.Jul.06 7:39am ^^^
Markdown("### Incremental Jupyter Notebook Code Backup 3")
! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 3.ipynb" 
! ls -l Project*.ipynb
Out[379]:

Incremental Jupyter Notebook Code Backup 3

-rw-r--r--  1 RiddhiSiddhi  staff  6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict Backup 3.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict.ipynb

D1.2: Bi Variate Analysis Starts below: vvv

Analyze Among Predictor Variables and Between Predictor & Target (Predicted) Columns. Comment on your findings in terms of their Relationship and Degree of Relation if any. Visualize the analysis using Boxplots and Pair Plots with Histograms or Density Curves.

In [613]:
Markdown('##### * PairPlot BiVariate Study among Predictor variables & with Predicted/Target column, with Density Curves (diag_kind="kde")')
g = sns.pairplot(cdf, diag_kind='kde')
Out[613]:
* PairPlot BiVariate Study among Predictor variables & with Predicted/Target column, with Density Curves (diag_kind="kde")
In [897]:
Markdown('##### * Individual Box Plots of ALL 9 Numerical Columns: Visuals For Quartiles, Middle & Outlier values:')

p = plt.figure(figsize=(20, 20)) 
pos = 1 
for col in cdf.columns: 
    p = plt.subplot(3, 3, pos) 
    g = sns.boxplot(cdf[col]) 
    pos += 1 
Out[897]:
* Individual Box Plots of ALL 9 Numerical Columns: Visuals For Quartiles, Middle & Outlier values:
In [818]:
# "Age" has 14 (limited) "bin like" non continous numerical values, though it can have continous values from 1 to 365
# Hence we can use BoxPlot & PointPlot to do BiVariate study between predictor "age" col and "mpa" col (predicted/target):

Markdown('### <center> * BoxPlot & PointPlot : BiVariate Study of "age" & "mpa" * </center>')
p = plt.figure(figsize=(20, 20)) 
g = sns.boxplot('age', 'mpa', data=cdf)
g = sns.pointplot(cdf.age, cdf.mpa)
Out[818]:

* BoxPlot & PointPlot : BiVariate Study of "age" & "mpa" *

In [618]:
# Additional Exhibit 1: 

Markdown('### * Correlation Matrix:')
cdf.corr() 

Markdown('### * Correlation HeatMap Matrix:')
p = plt.figure(figsize=(11, 7)) 
g = sns.heatmap(cdf.corr())
Out[618]:

* Correlation Matrix:

Out[618]:
cement slag ash water splast corse fine age mpa
cement 1.00 -0.28 -0.40 -0.08 0.09 -0.11 -0.22 0.08 0.50
slag -0.28 1.00 -0.32 0.11 0.04 -0.28 -0.28 -0.04 0.13
ash -0.40 -0.32 1.00 -0.26 0.38 -0.01 0.08 -0.15 -0.11
water -0.08 0.11 -0.26 1.00 -0.66 -0.18 -0.45 0.28 -0.29
splast 0.09 0.04 0.38 -0.66 1.00 -0.27 0.22 -0.19 0.37
corse -0.11 -0.28 -0.01 -0.18 -0.27 1.00 -0.18 -0.00 -0.16
fine -0.22 -0.28 0.08 -0.45 0.22 -0.18 1.00 -0.16 -0.17
age 0.08 -0.04 -0.15 0.28 -0.19 -0.00 -0.16 1.00 0.33
mpa 0.50 0.13 -0.11 -0.29 0.37 -0.16 -0.17 0.33 1.00
Out[618]:

* Correlation HeatMap Matrix:

In [666]:
# Additional Exhibit 2: 

# Density, Histogram, Scatter plots of All Attributes Interactions: 
# Different Upper & Lower triangles: eg. Scatter & KDE (Density)

Markdown('### * PairGrid BiVariate Study among Predictor variables & with Predicted/Target column')  
Markdown('* <u> Upper Half</u> : LINEAR REGRESSION Fitted line thru Scatter Plot')  
Markdown('* <u> Lower Half</u> : Kernel Densities among Atributes')   

g = sns.PairGrid(cdf)  

g = g.map_upper(sns.regplot)
g = g.map_lower(sns.kdeplot)

g = g.map_diag(plt.hist, lw=2)
plt.show()
Out[666]:

* PairGrid BiVariate Study among Predictor variables & with Predicted/Target column

Out[666]:
  • Upper Half : LINEAR REGRESSION Fitted line thru Scatter Plot
Out[666]:
  • Lower Half : Kernel Densities among Atributes
In [914]:
# Additional Exhibit 3: 

Markdown('##### * BiVariate Study of Predicted/Target "mpa" (y) column with remaining 8 (X) Predictor attributes:')  
Markdown('''* Regression Line Fitted thru Scatter Plot For:  
        1=Linear, 2=Quadratic, 3=Cubic. Using regplot(X, y) for order=1,2,3, Confidence Interval ci = 95 (line shadows)''')  

p = plt.figure(figsize=(20, 20)) 
pos = 1 
for col in cdf.columns: 
    p = plt.subplot(3, 3, pos) 
    g = sns.regplot(col, 'mpa', data=cdf, order=1) 
    g = sns.regplot(col, 'mpa', data=cdf, order=2) 
    g = sns.regplot(col, 'mpa', data=cdf, order=3) 
    p = plt.legend(['Linear','Quadratic','Cubic'], loc="best") 
    pos += 1 
Out[914]:
* BiVariate Study of Predicted/Target "mpa" (y) column with remaining 8 (X) Predictor attributes:
Out[914]:
  • Regression Line Fitted thru Scatter Plot For:
      1=Linear, 2=Quadratic, 3=Cubic. Using regplot(X, y) for order=1,2,3, Confidence Interval ci = 95 (line shadows)
In [820]:
### My Housekeeping: Jupyter Code File Incremental Backup 4 Wed.Jul.08 12:55am ^^^  
Markdown("### Incremental Jupyter Notebook Code Backup 4")  
! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 4.ipynb"  
! ls -l Project*.ipynb  
Out[820]:

Incremental Jupyter Notebook Code Backup 4

-rw-r--r--  1 RiddhiSiddhi  staff   6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict Backup 3.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10078136 Jul  8 00:55 Project 4 FEMST Concrete Strength Predict Backup 4.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10078136 Jul  8 00:55 Project 4 FEMST Concrete Strength Predict.ipynb
D1.2 Comments / Conclusions / Recommendations:
  • Relationship & Degree: Correlations: None are HIGH (-0.80+): Column(s) DROP = None

    • High : None
    • Significant : splast & water = -0.66 (Negative Correlation)
    • High Medium : cement & map = +.50 (Positive Correlation)
    • Low Medium : water & fine = -.45; cement & ash = -.40 (Negative Correlations)
    • Low/Notable : ash & splast = +.38; splast & map = +.37; age & mpa = +.33 (Positive Correlation)
  • Data Dispersion: Multiple Gaussians (Humps) in Distributions: SCALE Data = Yes

    • Minimal : mpa, cement
    • Moderate: water, corse, fine
    • Significant: slag, ash, splast, age
  • Data Dispersion: Regression Estimate: SCALE Data = Yes; FE/MT = Polynomials, Adv.Regr.Models, HyperTune

    • As apparent from the above Regression Plot (regplot, order=1,2,3) between "mpa" (Target) & 8 Predictor cols, that we are UNABLE to fit ANY regression line thru the Data the way its Dispersed (spread out) with even a reasonably accuracy, because the data points are scattered so far out from the regression lines for Linear, Quadratic, Cubic relationships (polynomials).
D1.2 ^^^ Completed & Delivered As Per Above Exhibits: BiVariate Analysis Ends here: ^^^
  • Visialized:

    • PairPlots: Displayed with Density
    • BoxPlots: Displayed for 1. For each Column. 2. With "age" & "mpa" with overlayed PointPlot ("age" & "mpa")
    • Corelation & Heatmap: Displayed
    • PairGrid: Displayed with Histogram and Kernel Densities (lower half, "kdeplot") & Linear Regression (upper half, "regplot")
    • RegPlot: Displayed Regression Plots for orders/degrees: 1=Linear, 2=Quadratic, 3=Cubic: For each Predictor column with "mpa" Predicted/Target column
  • Analyzed: Interation / Relationship among Predictor Variables AND between Predictors & Predicted/Target ("mpa") column. Observations and Findings on Relationship & Degree Relation provided above^

  • Comments: Conclusions & Recommendations provided above^
In [933]:
### My Housekeeping: Jupyter Code File Incremental Backup 5 Wed.Jul.08 11:23pm ^^^  
Markdown("### Incremental Jupyter Notebook Code Backup 5")  
! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 5.ipynb"  
! ls -l Project*.ipynb  
Out[933]:

Incremental Jupyter Notebook Code Backup 5

-rw-r--r--  1 RiddhiSiddhi  staff   6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict Backup 3.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10078136 Jul  8 00:55 Project 4 FEMST Concrete Strength Predict Backup 4.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10250327 Jul  8 23:23 Project 4 FEMST Concrete Strength Predict Backup 5.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10250327 Jul  8 23:23 Project 4 FEMST Concrete Strength Predict.ipynb

vvv D1.3: Feature Engineering Techniques: Begins Below vvv :

  • $D1.3.a:$ Identify opportunities (if any) to Extract a New Feature from existing features and/or Drop a Feature (if required).
  • $D1.3.b:$ Get data model ready and do a train test split.
  • $D1.3.c:$ Decide on complexity of the model, should it be Simple Linear Model in terms of Parameters or would it be a Quadratic or Higher degree.
In [945]:
# Drop Duplicate Rows (25) as determined earlier in UniVariate section: 

cdf1.shape  # before
cdf.drop_duplicates(keep='first', inplace=True)
cdf.shape  # after
print('Rows Dropped:', len(cdf1)-len(cdf))
Out[945]:
(1030, 9)
Out[945]:
(1005, 9)
Rows Dropped: 25
In [946]:
# My Housekeeping: Incremental DF Data Backup 2 as of Thu.Jul.09 3:29am : 
cdf2 = cdf.copy()  # Modified Df: Dropped 25 Duplicated Rows 

cdf2.to_csv('cdf2.csv')  # Also export as .csv file to disk 
! ls -l cdf*

# Verify backup copy
cdf2.shape, type(cdf2)
cdf2.sample(5)
-rw-r--r--  1 RiddhiSiddhi  staff  52485 Jun 30 01:52 cdf0.csv
-rw-r--r--  1 RiddhiSiddhi  staff  52467 Jun 30 02:27 cdf1.csv
-rw-r--r--  1 RiddhiSiddhi  staff  51200 Jul  9 03:29 cdf2.csv
Out[946]:
((1005, 9), pandas.core.frame.DataFrame)
Out[946]:
cement slag ash water splast corse fine age mpa
753 446.00 24.00 79.00 162.00 11.60 967.00 712.00 7 38.02
187 252.00 0.00 0.00 185.00 0.00 1,111.00 784.00 7 13.71
839 337.90 189.00 0.00 174.90 9.50 944.70 755.80 7 35.10
481 288.40 121.00 0.00 177.40 7.00 907.90 829.50 28 42.14
133 236.00 0.00 0.00 193.00 0.00 968.00 885.00 365 25.08
Outlers Imputation Strategies Evaluated for 99 Total outliers in the DF:
  • Methods NOT Used: Because it produced More / New Outliers after Imputations. (Tried Separately, code clutter removed):
    • Deleting Rows with Outliers Imputing Outliers with MEDIAN values Imputing Outliers with -(IQR * 1.5)+ values
  • Method USED: * Imputing Outliers with Q1 / Q3 values: : This method eliminating ALL existing outliers without creating more / new ones.
In [1063]:
# Impute All Outliers based on IQR Method with respective columns' Q1 or Q3 values: 

Markdown('### * DF Outliers Imputations with Q1 (Under Low End) or Q3 (Over High End) Values for All Numerical Columns: $ \ \ Low = Q1 - (IQR * 1.5) $; $ \ \ High = Q3 + (IQR *1.5 $)')

for col in cdf.select_dtypes(include='number'):

    q1   = cdf[col].quantile(.25)
    q3   = cdf[col].quantile(.75)

    otr  = (q3 - q1) * 1.5
    otl = round(q1 - otr,2)
    oth = round(q3 + otr,2)

    otls = (cdf[col] < otl).sum()
    oths = (cdf[col] > oth).sum()
        
    cdf[col] = np.where((cdf[col] < otl), q1, cdf[col])
    cdf[col] = np.where((cdf[col] > oth), q3, cdf[col])
    
    print('*', col, ':', otls+oths, 'values imputed with Q1, Q3 value:', otl, oth) 
    print(' ', str(otls).rjust(2, ' '), 'outliers Under Low  End', str(otl).rjust(5, ' ')) 
    print(' ', str(oths).rjust(2, ' '), 'outliers Over  High End', str(oth).rjust(5, ' ')) 
    print() 
Out[1063]:

DF Outliers Imputations with Q1 (Under Low End) or Q3 (Over High End) Values for All Numerical Columns: $ \ \ Low = Q1 - (IQR 1.5) $; $ \ \ High = Q3 + (IQR *1.5 $)

* cement : 0 values imputed with Q1, Q3 value: -46.75 586.45
   0 outliers Under Low  End -46.75
   0 outliers Over  High End 586.45

* slag : 2 values imputed with Q1, Q3 value: -213.75 356.25
   0 outliers Under Low  End -213.75
   2 outliers Over  High End 356.25

* ash : 0 values imputed with Q1, Q3 value: -177.45 295.75
   0 outliers Under Low  End -177.45
   0 outliers Over  High End 295.75

* water : 15 values imputed with Q1, Q3 value: 127.15 232.35
  11 outliers Under Low  End 127.15
   4 outliers Over  High End 232.35

* splast : 10 values imputed with Q1, Q3 value: -15.0 25.0
   0 outliers Under Low  End -15.0
  10 outliers Over  High End  25.0

* corse : 0 values imputed with Q1, Q3 value: 783.5 1179.5
   0 outliers Under Low  End 783.5
   0 outliers Over  High End 1179.5

* fine : 5 values imputed with Q1, Q3 value: 577.45 969.05
   0 outliers Under Low  End 577.45
   5 outliers Over  High End 969.05

* age : 59 values imputed with Q1, Q3 value: -66.5 129.5
   0 outliers Under Low  End -66.5
  59 outliers Over  High End 129.5

* mpa : 8 values imputed with Q1, Q3 value: -8.5 76.9
   0 outliers Under Low  End  -8.5
   8 outliers Over  High End  76.9

In [1076]:
# Post Imputation: Indetify New Outlier Values in All Numerical Columns: 

Markdown('### * DF New Outliers After Imputations for Numerical Columns: NO OUTLIERS: \n $ \ \ Low = Q1 - (IQR * 1.5) $; $ \ \ High = Q3 + (IQR *1.5 $)')
for col in cdf.select_dtypes(include='number'):
    q1   = cdf[col].quantile(.25)
    q3   = cdf[col].quantile(.75)
    otr  = (q3 - q1) * 1.5
    
    otl = round(q1 - otr,2)
    oth = round(q3 + otr,2)

    otls = (cdf[col] < otl).sum()
    oths = (cdf[col] > oth).sum()
    
    print('*', col)
    print(' ', str(otls).rjust(2, ' '), 'outliers Under Low  End', str(otl).rjust(5, ' '))
    print(' ', str(oths).rjust(2, ' '), 'outliers Over  High End', str(oth).rjust(5, ' '))
    print()
Out[1076]:

* DF New Outliers After Imputations for Numerical Columns: NO OUTLIERS:

$ \ \ Low = Q1 - (IQR * 1.5) $; $ \ \ High = Q3 + (IQR *1.5 $)

* cement
   0 outliers Under Low  End -46.75
   0 outliers Over  High End 586.45

* slag
   0 outliers Under Low  End -213.75
   0 outliers Over  High End 356.25

* ash
   0 outliers Under Low  End -177.45
   0 outliers Over  High End 295.75

* water
   0 outliers Under Low  End 127.15
   0 outliers Over  High End 232.35

* splast
   0 outliers Under Low  End -15.0
   0 outliers Over  High End  25.0

* corse
   0 outliers Under Low  End 783.5
   0 outliers Over  High End 1179.5

* fine
   0 outliers Under Low  End 577.45
   0 outliers Over  High End 969.05

* age
   0 outliers Under Low  End -66.5
   0 outliers Over  High End 129.5

* mpa
   0 outliers Under Low  End  -8.5
   0 outliers Over  High End  76.9

In [1150]:
# Verify Outliers with BoxPlot: 
Markdown('##### * Verify Outliers with BoxPlot: NO OUTLIERS:')
p = plt.figure(figsize=(20, 7)) 
g = cdf.boxplot()
Out[1150]:
* Verify Outliers with BoxPlot: NO OUTLIERS:
In [1081]:
# My Housekeeping: Incremental DF Data Backup 3 as of Thu.Jul.09 8:01am : 
cdf3 = cdf.copy()  # Modified Df: Imputed 99 Outlier values to Q1 and/or Q3 column values respectively as applicable.

cdf3.to_csv('cdf3.csv')  # Also export as .csv file to disk 
! ls -l cdf*

# Verify backup copy
cdf3.shape, type(cdf3)
cdf3.sample(5)
-rw-r--r--  1 RiddhiSiddhi  staff  52485 Jun 30 01:52 cdf0.csv
-rw-r--r--  1 RiddhiSiddhi  staff  52467 Jun 30 02:27 cdf1.csv
-rw-r--r--  1 RiddhiSiddhi  staff  51200 Jul  9 03:29 cdf2.csv
-rw-r--r--  1 RiddhiSiddhi  staff  53157 Jul  9 08:01 cdf3.csv
Out[1081]:
((1005, 9), pandas.core.frame.DataFrame)
Out[1081]:
cement slag ash water splast corse fine age mpa
771 168.90 42.20 124.30 158.30 10.80 1,080.80 796.20 28.00 31.12
385 305.30 203.50 0.00 203.50 0.00 965.40 631.00 28.00 43.38
301 277.10 0.00 97.40 160.60 11.80 973.90 875.60 56.00 51.04
835 222.40 0.00 96.70 189.30 4.50 967.10 870.30 28.00 24.89
842 332.50 142.50 0.00 228.00 0.00 932.00 594.00 7.00 30.28
In [1082]:
### My Housekeeping: Jupyter Code File Incremental Backup 6 Thu.Jul.09 08:04am ^^^  
Markdown("### Incremental Jupyter Notebook Code Backup 6")  
! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 6.ipynb"  
! ls -l Project*.ipynb  
Out[1082]:

Incremental Jupyter Notebook Code Backup 6

-rw-r--r--  1 RiddhiSiddhi  staff   6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict Backup 3.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10078136 Jul  8 00:55 Project 4 FEMST Concrete Strength Predict Backup 4.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10250327 Jul  8 23:23 Project 4 FEMST Concrete Strength Predict Backup 5.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10115330 Jul  9 08:04 Project 4 FEMST Concrete Strength Predict Backup 6.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10115330 Jul  9 08:04 Project 4 FEMST Concrete Strength Predict.ipynb
In [1326]:
cdf1
Out[1326]:
cement slag ash water splast corse fine age mpa
0 141.30 212.00 0.00 203.50 0.00 971.80 748.50 28 29.89
1 168.90 42.20 124.30 158.30 10.80 1,080.80 796.20 14 23.51
2 250.00 0.00 95.70 187.40 5.50 956.90 861.20 28 29.22
3 266.00 114.00 0.00 228.00 0.00 932.00 670.00 28 45.85
4 154.80 183.40 0.00 193.30 9.10 1,047.40 696.70 28 18.29
... ... ... ... ... ... ... ... ... ...
1025 135.00 0.00 166.00 180.00 10.00 961.00 805.00 28 13.29
1026 531.30 0.00 0.00 141.80 28.20 852.10 893.70 3 41.30
1027 276.40 116.00 90.30 179.60 8.90 870.10 768.30 28 44.28
1028 342.00 38.00 0.00 228.00 0.00 932.00 670.00 270 55.06
1029 540.00 0.00 0.00 173.00 0.00 1,125.00 613.00 7 52.61

1030 rows × 9 columns

In [1499]:
# Prepare data for split: Create X, y (Predictor, Predicted) datasets: 

# X = cdf1.copy()  # Contains: 25 duplicates; 99 Outliers, i.e. Before removing dups & outliers
X = cdf.copy() # Dups & Outliers Removed
y = X.pop('mpa')

# Split df data into 3 datasets: Train, Test, Validate: 
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=3)
X_trn, X_val, y_trn, y_val = train_test_split(X_trn, y_trn, test_size=0.25, random_state=3)
In [1500]:
X.shape, y.shape, X_trn.shape, X_tst.shape, X_val.shape, y_trn.shape, y_tst.shape, y_val.shape
type(X_trn), type(X_tst), type(X_val), type(y_trn), type(y_tst), type(y_val)
Out[1500]:
((1005, 8), (1005,), (603, 8), (201, 8), (201, 8), (603,), (201,), (201,))
Out[1500]:
(pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.series.Series,
 pandas.core.series.Series,
 pandas.core.series.Series)
In [1501]:
lrm = LinearRegression()
lrm.fit(X_trn, y_trn)
Out[1501]:
LinearRegression()
In [1502]:
print('\n* UnScaled Data: Coefficients:', lrm.coef_)
print('\n* UnScaled Data: Intercept:', lrm.intercept_)
Markdown('###### * UnScaled Data: $R^2$ Score for: In Sample & Out Of Sample (Validaton dataset):') 
print(' * In Sample R^2:', lrm.score(X_trn, y_trn), 'Out Of Sample R^2:', lrm.score(X_val, y_val))
print(' * Root Mean Square Error RMSE:', mean_squared_error(y_val, lrm.predict(X_val))**0.5)
# print(lrm.summary())
* UnScaled Data: Coefficients: [ 0.08734746  0.05456783  0.03350266 -0.24556834  0.14132574 -0.02387326
 -0.02728903  0.30367803]

* UnScaled Data: Intercept: 82.87090539868043
Out[1502]:
* UnScaled Data: $R^2$ Score for: In Sample & Out Of Sample (Validaton dataset):
 * In Sample R^2: 0.7330438038819633 Out Of Sample R^2: 0.680319427895178
 * Root Mean Square Error RMSE: 8.898078241713177
In [1503]:
# print('Predict y on Validation dataset:', lrm.predict(X_val).reshape(-1,))
In [1504]:
std_sclr = StandardScaler()

X_trn = std_sclr.fit_transform(X_trn)
X_val = std_sclr.fit_transform(X_val)

X_trn = pd.DataFrame(X_trn)
X_val = pd.DataFrame(X_val)

y_trn = pd.DataFrame(y_trn)
y_val = pd.DataFrame(y_val)

y_trn = std_sclr.fit_transform(y_trn)
y_val = std_sclr.fit_transform(y_val)

X_trn.columns = X.columns
X_val.columns = X.columns

lrm_s = LinearRegression()
lrm_s.fit(X_trn, y_trn)

# X_train = scaler.fit_transform(X_train)
# X_vaid = scaler.transform(X_valid)

# X_test = scaler.transform(X_test)

# model.fit(X_train,y_train)
# y_pred_valid = model.predict(X_valid).reshape(-1,)  # array.reshape(-1, 1) 
Out[1504]:
LinearRegression()
In [1506]:
print('\n* SCALED Data: Coefficients:', lrm_s.coef_)
print('\n* SCALED Data: Intercept:', lrm_s.intercept_)
Markdown('###### * SCALED Data: $R^2$ Score for In Sample, Out Of Sample (Validaton dataset: SCALED Data):') 
print(' * In Sample R^2:', lrm_s.score(X_trn, y_trn), 'Out Of Sample R^2:', lrm_s.score(X_val, y_val))
print(' * Root Mean Square Error RMSE:', mean_squared_error(y_val, lrm_s.predict(X_val))**0.5)
# print(lrm_s.summary())
* SCALED Data: Coefficients: [[ 0.58985309  0.300613    0.13890307 -0.31198912  0.04802968 -0.12048633
  -0.14206641  0.5491025 ]]

* SCALED Data: Intercept: [-7.99148581e-16]
Out[1506]:
* SCALED Data: $R^2$ Score for In Sample, Out Of Sample (Validaton dataset: SCALED Data):
 * In Sample R^2: 0.7330438038819633 Out Of Sample R^2: 0.682084887709276
 * Root Mean Square Error RMSE: 0.5638396157514333
In [1507]:
# print('Predict y on Validation dataset:', lrm.predict(X_val).reshape(-1,))
# print('Predict y on Train dataset:', lrm.predict(X_trn).reshape(-1,))
In [1519]:
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_trn_p2 = poly.fit_transform(X_trn)
X_val_p2 = poly.fit_transform(X_val)

poly2m = LinearRegression()

poly2m.fit(X_trn_p2, y_trn)

y_pred_poly2 = poly2m.predict(X_val_p2)

# print(y_pred_poly2)

print('* Data SCALED & CLEANED (No Dups, No Outliers):')
print('* R^2 Score: In Sample:', poly2m.score(X_trn_p2, y_trn), 'R^2 Score: Out Of Sample:', poly2m.score(X_val_p2, y_val))
print('* Root Mean Square Error RMSE:', mean_squared_error(y_val, poly2m.predict(X_val_p2))**0.5)
print('* Additional Features/Columns (Polynomials) Created:', X_trn_p2.shape[1] - X_trn.shape[1])
Out[1519]:
LinearRegression()
* Data SCALED & CLEANED (No Dups, No Outliers):
* R^2 Score: In Sample: 0.7922581527934405 R^2 Score: Out Of Sample: 0.7234816793366159
* Root Mean Square Error RMSE: 0.5258500933378105
* Additional Features/Columns (Polynomials) Created: 29
In [1520]:
poly = PolynomialFeatures(degree=3, interaction_only=True)
X_trn_p3 = poly.fit_transform(X_trn)
X_val_p3 = poly.fit_transform(X_val)

poly3m = LinearRegression()

poly3m.fit(X_trn_p3, y_trn)

y_pred_poly3 = poly3m.predict(X_val_p3)

# print(y_pred)

print('* Data SCALED & CLEANED (No Dups, No Outliers):')
print('* R^2 Score: In Sample:', poly3m.score(X_trn_p3, y_trn), 'R^2 Score: Out Of Sample:', poly3m.score(X_val_p3, y_val))
print('* Root Mean Square Error RMSE:', mean_squared_error(y_val, poly3m.predict(X_val_p3))**0.5)
print('* Additional Features/Columns (Polynomials) Created:', X_trn_p3.shape[1] - X_trn.shape[1])
Out[1520]:
LinearRegression()
* Data SCALED & CLEANED (No Dups, No Outliers):
* R^2 Score: In Sample: 0.8414541656579173 R^2 Score: Out Of Sample: 0.6823193595550907
* Root Mean Square Error RMSE: 0.563631653160918
* Additional Features/Columns (Polynomials) Created: 85
D1.3 ^^^ Completed & Delivered As Per Above Exhibits: Feature Engineering Techniques: Ends Here: ^^^ :
  • $D1.3.a:$ DONE: Extract a New Feature and/or Drop a Feature (if required):
    • NO New Feature(s) Extracts & NO Feature(s) Drop: As per the above Obervations and Evaluations it is not Recommended to Extract New Features or Drop any Existing Features. We can get High Performace on the Higher side of the Expected Range (As per Delivrable D2.x in the next section below).
  • $D1.3.b:$ DONE: Get data model ready and do a train test split
    • Created THREE Datasets: TRAIN (X_trn, y_trn), Validation (X_val, y_val), TEST (X_tst, y_tst): To prevent DATA LEAKS, all training & testing to be done on _trn and _val datasets. TEST datasets (_tst) to be RESERVED and kept aside, UNSEEN from any modelling, transformation algorithms activities until the very end when we decide on Final Best Model. At that time we will use the TEST dataset to test and evaluate the FINAL BEST MODEL.
  • $D1.3.c:$ DONE: Decide on complexity of the model: Simple Linear, Quadratic or Higher degree
    • Performed Polynomial operations for degree/order = 1,2,3 (Linear, Quadratic, Cubic).
    • Drastic GAIN in $R^2$ and Drastic Reduction in RMSE to 0.56.
    • The max $R^2$ achieved for degree=3 (Cubic) was 84% (in Sample) and 68% (Out of Sample) which is below Expectation.
    • Cost: 29 ADDITIONAL Features for Quadratic and 85 ADDITIONAL Features for Cubic. This "Curse Of Dimensionality" is not worth the Performance obtained.
    • HENCE: Use SIMPLE LINEAR REGRESSION MODEL: Because we can get BETTER results without Additional / Engineered Features or Dropping Existing Features using other Regression alogrithms like Decision Tree, Random Forest, Boost Algorithms with x_SearchCV otpions (As per Delivrable D2.x in the next section below).

*D2: Creating the Model and Tuning it:*

D2.1: vvv Algorithms: Begins Here: vvv

Use the Algorithms that you think will be suitable for this project (at least 3 algorithms). Use ***Kfold Cross Validation*** to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics.

Dataset PreProcessing Choices: As per foregoing observations based on analytics and results it appear that:
  • NO Significant Performance Improvement is achieved with:
    • Scalling & Polynomials.
    • Presence or Absence of of 25 (2.43%) Duplicated Rows in the dataset of 1030 rows.
    • Presence or Absence of of 99 (1.07%) Outliers in a dataset (df) of 1030 rows x 9 cols = 9,270 elements.
  • HENCE: To keep the soultion Simpler & Smaller and YET achiving the expected HIGH Performance, we will REVERT back and USE the ORIGINAL dataset having the above mentioned duplicates & outliers.
In [1523]:
# As we just concluded above, we will revert back to the Orifinal df WITHOUT any preprocessing applied for dups, outliers.
# Also we will split dfs again: 

# Prepare data for split: Create X, y (Predictor, Predicted) datasets times 3 for Train, Validation, Test

# X = cdf.copy() # Dups & Outliers Removed
X = cdf1.copy()  # Contains: 25 duplicates; 99 Outliers, i.e. Before removing Dups & Outliers
y = X.pop('mpa')

# Split df data into 3 datasets: Train, Test, Validate: 
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=3)
X_trn, X_val, y_trn, y_val = train_test_split(X_trn, y_trn, test_size=0.25, random_state=3)

X.shape, y.shape, X_trn.shape, X_tst.shape, X_val.shape, y_trn.shape, y_tst.shape, y_val.shape
type(X_trn), type(X_tst), type(X_val), type(y_trn), type(y_tst), type(y_val)
Out[1523]:
((1030, 8), (1030,), (618, 8), (206, 8), (206, 8), (618,), (206,), (206,))
Out[1523]:
(pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.series.Series,
 pandas.core.series.Series,
 pandas.core.series.Series)
In [1524]:
# Trying Decision Tree: 
dTree = DecisionTreeRegressor(random_state=6)
dTree.fit(X_trn, y_trn)

print(dTree.score(X_trn, y_trn))
print(dTree.score(X_val, y_val))
Out[1524]:
DecisionTreeRegressor(random_state=6)
0.9999571857586979
0.8072407565692836
In [1525]:
# Trying Decision Tree: PRUNED: 

dTreeR = DecisionTreeRegressor(max_depth=9, min_samples_leaf=3, random_state=6)
dTreeR.fit(X_trn, y_trn)

print(dTreeR.score(X_trn, y_trn))
print(dTreeR.score(X_val, y_val))
Out[1525]:
DecisionTreeRegressor(max_depth=9, min_samples_leaf=3, random_state=6)
0.943778925044059
0.7649833982162988
In [1526]:
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp DTree UnPruned"], index = X_trn.columns))
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Imp DTree Pruned"], index = X_trn.columns))
        Imp DTree UnPruned
cement                0.34
slag                  0.08
ash                   0.01
water                 0.08
splast                0.06
corse                 0.03
fine                  0.06
age                   0.35
        Imp DTree Pruned
cement              0.34
slag                0.07
ash                 0.01
water               0.07
splast              0.06
corse               0.02
fine                0.06
age                 0.36
In [1529]:
# Random Forest Classifier Learning Model:
rfcl = RandomForestRegressor(n_estimators = 50, random_state=6, max_features=8)
rfcl = rfcl.fit(X_trn, y_trn)

# y_predict = rfcl.predict(X_test)
# y_predict_rf = y_predict
# print(rfcl.score(X_test, y_test))

print(rfcl.score(X_trn, y_trn))
print(rfcl.score(X_val, y_val))
0.9865892384207504
0.8997314457247279
In [1530]:
# Build AdaBooster Learning Model:
abcl = AdaBoostRegressor(n_estimators=90, random_state=6, learning_rate=0.6)
abcl = abcl.fit(X_trn, y_trn)

# y_predict = abcl.predict(X_test)
# y_predict_ab = y_predict
# print(abcl.score(X_test , y_test))

print(abcl.score(X_trn, y_trn))
print(abcl.score(X_val, y_val))
0.8267219994089168
0.8000375936309319
In [1531]:
# Gradient Boost Learning Model:
gbcl = GradientBoostingRegressor(n_estimators = 40, random_state=6, learning_rate=0.3)
gbcl = gbcl.fit(X_trn, y_trn)

# y_predict = gbcl.predict(X_test)
# y_predict_gb = y_predict
# print(gbcl.score(X_test, y_test))

print(gbcl.score(X_trn, y_trn))
print(gbcl.score(X_val, y_val))
0.9639792909595184
0.90229938952186
In [1532]:
# Build Bagging Learning Model:
bgcl = BaggingRegressor(base_estimator=dTree, n_estimators=12, random_state=6)
bgcl = bgcl.fit(X_trn, y_trn)

# y_predict = bgcl.predict(X_test)
# y_predict_bg = y_predict
# print(bgcl.score(X_test , y_test))

print(bgcl.score(X_trn, y_trn))
print(bgcl.score(X_val, y_val))
0.9837672362553681
0.8981940858966639
Perform KFold CV operations on the BEST Performing Algorithms: 1. GradientBoostingRegressor & 2. RandomForestRegressor as below:
In [1533]:
kfold = KFold(n_splits=50, random_state=6)
model = GradientBoostingRegressor() 
results = cross_val_score(model, X, y, cv=kfold) # , scoring='r2')
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.71479255 0.94974578 0.82956017 0.88314904 0.92740687 0.88543303
 0.88404223 0.84816082 0.92980822 0.84583363 0.92372438 0.96296529
 0.94924543 0.89265465 0.9260654  0.72133464 0.86520491 0.95938137
 0.89883094 0.91426367 0.94235616 0.79924526 0.9566555  0.74903884
 0.83628897 0.91487578 0.90747936 0.90538931 0.9463377  0.87970838
 0.91895899 0.96361487 0.93371566 0.90791054 0.89614271 0.94100964
 0.89785204 0.87923102 0.90875094 0.82815615 0.92463791 0.89599374
 0.95291419 0.92648235 0.89271041 0.86194352 0.92314234 0.89339438
 0.91957533 0.88654067]
Accuracy: 89.403% (5.606%)
In [1534]:
kfold = KFold(n_splits=50, random_state=6)
model = RandomForestRegressor() 
results = cross_val_score(model, X, y, cv=kfold) # , scoring='r2')
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.80389107 0.93949005 0.74311977 0.83958919 0.93321548 0.81129603
 0.92068416 0.94116789 0.96308107 0.79416979 0.93174678 0.97276901
 0.9634126  0.96304661 0.94228044 0.81989615 0.88582648 0.97375469
 0.93102057 0.95207913 0.92442369 0.88625798 0.96779646 0.72975201
 0.89267977 0.9464176  0.94371566 0.91367058 0.96929111 0.90709284
 0.8956873  0.95412055 0.93619966 0.95565484 0.87203474 0.94308593
 0.88887171 0.90825156 0.89346433 0.89473526 0.91975935 0.91632973
 0.96570536 0.9513681  0.90515158 0.91325385 0.95483466 0.97613621
 0.92189027 0.94111998]
Accuracy: 91.229% (5.695%)

$$ * \ Results: Metrics \ Comparision \ * $$

Regression Algorithm Other Metrics R^2 Train DataSet R^2 Validation Dataset Comments
Simple Linear RMSE: 8.898078241713177 0.7330438038819633 0.680319427895178 Data: UnScaled, Cleaned
Simple Linear RMSE: 0.5638396157514333 0.7330438038819633 0.682084887709276 Data: Scaled, Cleaned
Polynomial LinRegr: Degree=2 RMSE: 0.5258500933378105 0.7922581527934405 0.7234816793366159 29 New Features Added. Data: Scaled, Cleaned
Polynomial LinRegr: Degree=3 RMSE: 0.563631653160918 0.8414541656579173 0.6823193595550907 85 New Features Added. Data: Scaled, Cleaned
Decision Tree, UnPrunned 0.9999571857586979 0.8072407565692836 Original Data: UnScaled, Not PreProcessed: Contains 25 Duplicate Rows; 99 Outliers
Decision Tree, Prunned: Depth=3 0.943778925044059 0.7649833982162988 ^^^ Ditto ^^^
Random Forest 0.9865892384207504 0.8997314457247279 ^^^ Ditto ^^^
AdaBoost 0.8267219994089168 0.8000375936309319 ^^^ Ditto ^^^
Gradient Boost 0.9639792909595184 0.90229938952186 ^^^ Ditto ^^^
Bagging 0.9837672362553681 0.8981940858966639 ^^^ Ditto ^^^
KFold for GradientBoostingRegressor Accuracy: 89.403% (5.606% Std.Div.) ^^^ Ditto ^^^
KFold for RandomForestRegressor Accuracy: 91.229% (5.695% Std.Div.) ^^^ Ditto ^^^
D2.1 ^^^ Completed & Delivered As Per Above Exhibits: Algorithms: Ends Here: ^^^ :
  • DONE: Regression Algorithms Used (At Least 3):
    • Simple Linear Rgression
    • Decesion Tree
    • Random Forest
    • Ada Boost
    • Gradient Boost
    • Bagging
  • DONE: KFold CV Used for these Best Performing Algorithms: 1. GradientBoostingRegressor 2. RandomForestRegressor
  • DONE: Metrics Dataframe/Tabulation: Created & Presented above ^^^
In [1536]:
### My Housekeeping: Jupyter Code File Incremental Backup 8 Sat.Jul.11 12:25am ^^^  
Markdown("### Incremental Jupyter Notebook Code Backup 8")  
# ! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 8.ipynb"  
! ls -l Project*.ipynb  
Out[1536]:

Incremental Jupyter Notebook Code Backup 8

-rw-r--r--  1 RiddhiSiddhi  staff   6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict Backup 3.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10078136 Jul  8 00:55 Project 4 FEMST Concrete Strength Predict Backup 4.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10250327 Jul  8 23:23 Project 4 FEMST Concrete Strength Predict Backup 5.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10115330 Jul  9 08:04 Project 4 FEMST Concrete Strength Predict Backup 6.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10298475 Jul 10 19:59 Project 4 FEMST Concrete Strength Predict Backup 7.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10303277 Jul 11 00:25 Project 4 FEMST Concrete Strength Predict Backup 8.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10303277 Jul 11 00:25 Project 4 FEMST Concrete Strength Predict.ipynb

vvv * D2.2: Tuning Techniques: Begins Here: vvv

Employ Techniques to squeeze that extra performance out of the model without making it over fit. Use ***Grid Search or Random Search*** on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above.

In [1420]:
# Instantiate a model for RandomForestRegressor:

model_rgr = RandomForestRegressor(n_estimators=50)

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 7),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["mse", "mae"]}

# run randomized search
# n_iter = number of random samples 
randomCV = RandomizedSearchCV(model_rgr, param_distributions=param_dist, n_iter=10) #default cv = 3 
In [1421]:
# Fit / Run RandomizedSearchCV for RandomForestRegressor :

randomCV.fit(X, y) 

print(randomCV.best_params_)
len(randomCV.cv_results_['mean_test_score'])  # MSB: These many model fits (runs)
randomCV.cv_results_['mean_test_score']  # :MSB 
Out[1421]:
RandomizedSearchCV(estimator=RandomForestRegressor(n_estimators=50),
                   param_distributions={'bootstrap': [True, False],
                                        'criterion': ['mse', 'mae'],
                                        'max_depth': [3, None],
                                        'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a4c2d2110>,
                                        'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a4c2d2450>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a4e438310>})
{'bootstrap': False, 'criterion': 'mae', 'max_depth': None, 'max_features': 4, 'min_samples_leaf': 2, 'min_samples_split': 9}
Out[1421]:
10
Out[1421]:
array([0.62623278, 0.58242664, 0.64311676, 0.88580563, 0.73908991,
       0.88215614, 0.90076659, 0.65429043, 0.66416735, 0.60104706])
In [1436]:
# Instantiate a model for GradientBoostingRegressor:

model_gbr = GradientBoostingRegressor()

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 7),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "learning_rate": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 
              "criterion": ["mse", "mae", 'friedman_mse']}

# run randomized search
# n_iter = number of random samples 
randomCV = RandomizedSearchCV(model_gbr, param_distributions=param_dist, n_iter=10) #default cv = 3 
In [1437]:
# Fit / Run RandomizedSearchCV for GradientBoostingRegressor : 

! date
randomCV.fit(X, y) 
! date

print(randomCV.best_params_)
len(randomCV.cv_results_['mean_test_score'])  # MSB: These many model fits (runs)
randomCV.cv_results_['mean_test_score']  # :MSB 
Fri Jul 10 09:31:47 EDT 2020
Out[1437]:
RandomizedSearchCV(estimator=GradientBoostingRegressor(),
                   param_distributions={'criterion': ['mse', 'mae',
                                                      'friedman_mse'],
                                        'learning_rate': [0.1, 0.2, 0.3, 0.4,
                                                          0.5, 0.6, 0.7, 0.8,
                                                          0.9, 1.0],
                                        'max_depth': [3, None],
                                        'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a675e3a50>,
                                        'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a675e3a90>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a675e3c90>})
Fri Jul 10 09:32:28 EDT 2020
{'criterion': 'mse', 'learning_rate': 0.6, 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 10, 'min_samples_split': 2}
Out[1437]:
10
Out[1437]:
array([0.91229875, 0.89962759, 0.91074717, 0.89743273, 0.8906354 ,
       0.90189426, 0.89706885, 0.85087284, 0.88254645, 0.89691306])
In [1441]:
# Based on RandomSearchCV above^ the following is BEST Model / Algorithm with these params: 
# Build the FINAL Model & validate it on the UNSEEN / RESERVED TEST dataset: 

# Gradient Boost Learning Model:
gbcl = GradientBoostingRegressor(criterion='mse', learning_rate=0.6, max_depth=3, random_state=3, 
                                  max_features=6, min_samples_leaf=10, min_samples_split=2)
gbcl = gbcl.fit(X_trn, y_trn)

# y_predict = gbcl.predict(X_test)
# y_predict_gb = y_predict
# print(gbcl.score(X_test, y_test))

print(gbcl.score(X_trn, y_trn))
print(gbcl.score(X_val, y_val))
0.9855879599041985
0.8935105358275034
In [1443]:
# *** FINAL SCORE ON the UNSEEN TEST DATASET: ***  
print(gbcl.score(X_tst, y_tst))
0.9161131402242251
In [1442]:
# Perform KFold on this FINAL BEST MODEL to check the Accuracy: 

kfold = KFold(n_splits=50, random_state=7)
model = GradientBoostingRegressor(criterion='mse', learning_rate=0.6, max_depth=3, random_state=3, 
                                  max_features=6, min_samples_leaf=10, min_samples_split=2) 
results = cross_val_score(model, X_trn, y_trn, cv=kfold) # , scoring='r2')
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.87487065 0.95957705 0.94194542 0.78945003 0.96906788 0.9510934
 0.94492606 0.82635881 0.91334211 0.93334992 0.82722828 0.76615126
 0.90684235 0.94722727 0.93144851 0.94907077 0.93443765 0.88121848
 0.9132982  0.92204492 0.94044749 0.82091912 0.92409265 0.92333839
 0.92534553 0.93417099 0.83029206 0.94369207 0.91268167 0.78430063
 0.94925291 0.89609384 0.82289123 0.95882898 0.94969872 0.60793289
 0.98550107 0.96814283 0.93826759 0.90425085 0.85586585 0.94789081
 0.8571517  0.8886891  0.88797866 0.98493645 0.77745116 0.80695179
 0.92501589 0.97793602]
Accuracy: 89.826% (7.118%)
D2.2 ^^^ Completed & Delivered As Per Above Exhibits: Tuning Techniques: Ends Here: ^^^ :
  • Used RandomizedSearchCV for the Two Best Performing Algorithms for the Best Parameter Search:
    • RandomForestRegressor: Average Score: 85%
    • GradientBoostingRegressor: Average Score: 90%
  • BEST Algorithm based on Search Parameters:
    • GradientBoostingRegressor(criterion='mse', learning_rate=0.6, max_depth=3, random_state=3, max_features=6, min_samples_leaf=10, min_samples_split=2)
    • Test Data: SCORE: 0.9161131402242251 ***(91.61%)***: On the UNSEEN / Reserved Test Data : BEST Final SCORE on high end of Goal Range: 80% to 95%
    • KFoldCV Accuracy: 89.826% (7.118% Std.Div) On GradientBoostingRegressor with the Finalized PARAMETERS as above^

D2 Complete

End Of Project Deliverables

In [1449]:
### My Housekeeping: Jupyter Code File Incremental Backup 7 Fri.Jul.10 07:59pm ^^^  
Markdown("### Incremental Jupyter Notebook Code Backup 7")  
# ! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 7.ipynb"  
! ls -l Project*.ipynb  
Out[1449]:

Incremental Jupyter Notebook Code Backup 7

-rw-r--r--  1 RiddhiSiddhi  staff   6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict Backup 3.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10078136 Jul  8 00:55 Project 4 FEMST Concrete Strength Predict Backup 4.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10250327 Jul  8 23:23 Project 4 FEMST Concrete Strength Predict Backup 5.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10115330 Jul  9 08:04 Project 4 FEMST Concrete Strength Predict Backup 6.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10298475 Jul 10 19:59 Project 4 FEMST Concrete Strength Predict Backup 7.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10299866 Jul 10 20:01 Project 4 FEMST Concrete Strength Predict.ipynb
In [1538]:
### My Housekeeping: Jupyter Code File Incremental Backup 9 Sat.Jul.11 01:15am ^^^  
Markdown("### Incremental Jupyter Notebook Code Backup 9")  
! cp "Project 4 FEMST Concrete Strength Predict.ipynb" "Project 4 FEMST Concrete Strength Predict Backup 9.ipynb"  
! ls -l Project*.ipynb  
Out[1538]:

Incremental Jupyter Notebook Code Backup 9

-rw-r--r--  1 RiddhiSiddhi  staff   6736637 Jun 30 02:47 Project 4 FEMST Concrete Strength Predict Backup 1.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6745030 Jul  5 21:49 Project 4 FEMST Concrete Strength Predict Backup 2.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff   6925242 Jul  6 07:39 Project 4 FEMST Concrete Strength Predict Backup 3.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10078136 Jul  8 00:55 Project 4 FEMST Concrete Strength Predict Backup 4.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10250327 Jul  8 23:23 Project 4 FEMST Concrete Strength Predict Backup 5.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10115330 Jul  9 08:04 Project 4 FEMST Concrete Strength Predict Backup 6.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10298475 Jul 10 19:59 Project 4 FEMST Concrete Strength Predict Backup 7.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10303277 Jul 11 00:25 Project 4 FEMST Concrete Strength Predict Backup 8.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10161012 Jul 11 01:15 Project 4 FEMST Concrete Strength Predict Backup 9.ipynb
-rw-r--r--  1 RiddhiSiddhi  staff  10161012 Jul 11 01:15 Project 4 FEMST Concrete Strength Predict.ipynb